Learning to use tidyverse for data exploration and modelling and bla bla
2022-05-07
Learning to use tidyverse for data exploration and modelling and bla bla
National Health and Nutrition Examination Survey data concerning glycohemoglobin levels and diabetes mellitus (DM) from https://hbiostat.org/data/.
Why this dataset?
| Variable | Description | Units | Levels |
|---|---|---|---|
| seqn | Unique patient ID | ||
| sex | Gender | 0, 1 | |
| age | Age | Years | 12 - 80 |
| re | Race/ethnicity | 5 levels: White, Black, Mexican, Other Hispanic, Other | |
| income | Family income level | $ | 14 levels from 0 - 100000 |
| tx | On Insulin or Diabetes meds | 0, 1 | |
| dx | Diagnosed with DM or pre-DM | 0, 1 | |
| wt | Weight | kg | 28 - 239.4 |
| ht | Height | cm | 123.3 - 202.7 |
| bmi | Body-mass index | kg/m^2 | 13.18 - 84.87 |
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
| gh | Glycohemoglobin | % | 4 - 16.4 |
| albumin | Albumin | g/dL | 2.5 - 5.3 |
| bun | Blood urea nitrogen | mg/dL | 1 - 90 |
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
| Variable | Description | Units | Levels |
|---|---|---|---|
| seqn | Unique patient ID | ||
| sex | Gender | 0, 1 | |
| age | Age | Years | 12 - 80 |
| re | Race/ethnicity | 5 levels: White, Black, Mexican, Other Hispanic, Other | |
| income | Family income level | $ | 14 levels from 0 - 100000 |
| tx | On Insulin or Diabetes meds | 0, 1 | |
| dx | Diagnosed with DM or pre-DM | 0, 1 | |
| wt | Weight | kg | 28 - 239.4 |
| ht | Height | cm | 123.3 - 202.7 |
| bmi | Body-mass index | kg/m^2 | 13.18 - 84.87 |
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
| gh | Glycohemoglobin | % | 4 - 16.4 |
| albumin | Albumin | g/dL | 2.5 - 5.3 |
| bun | Blood urea nitrogen | mg/dL | 1 - 90 |
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
DX does not differentiate between type I and type II
| Variable | Description | Units | Levels |
|---|---|---|---|
| seqn | Unique patient ID | ||
| sex | Gender | 0, 1 | |
| age | Age | Years | 12 - 80 |
| re | Race/ethnicity | 5 levels: White, Black, Mexican, Other Hispanic, Other | |
| income | Family income level | $ | 14 levels from 0 - 100000 |
| tx | On Insulin or Diabetes meds | 0, 1 | |
| dx | Diagnosed with DM or pre-DM | 0, 1 | |
| wt | Weight | kg | 28 - 239.4 |
| ht | Height | cm | 123.3 - 202.7 |
| bmi | Body-mass index | kg/m^2 | 13.18 - 84.87 |
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
| gh | Glycohemoglobin | % | 4 - 16.4 |
| albumin | Albumin | g/dL | 2.5 - 5.3 |
| bun | Blood urea nitrogen | mg/dL | 1 - 90 |
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
| Variable | Description | Units | Levels |
|---|---|---|---|
| income | Family income level | $ | 14 levels from 0 - 100000 |
Here we assigned the mean from all non-NA values of income.
| Variable | Description | Units | Levels |
|---|---|---|---|
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
Here we implemented KNN (K=5) in tidyverse. We did not optimize K.
Biochemical variables have more outliers
| Variable | Description | Units | Levels |
|---|---|---|---|
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
Normal range is 0.6 - 1.3 mg/dL, 5+ indicates severe kidney impairment. We removed all values above 5 (17 total values).